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Abstract 

We study universal compression of sequences generated by monotonic distributions. We 
show that for a monotonic distribution over an alphabet of size k, each probability parameter 
costs essentially 0.51og(n/fc 3 ) bits, where n is the coded sequence length, as long ask = o(n 1 / 3 ). 
Otherwise, for k — 0(n), the total average sequence redundancy is 0(n 1 / 3+e ) bits overall. We 
then show that there exists a sub-class of monotonic distributions over infinite alphabets for 
which redundancy of 0(n 1 / 3+e ) bits overall is still achievable. This class contains fast decaying 
distributions, including many distributions over the integers and geometric distributions. For 
some slower decays, including other distributions over the integers, redundancy of o(n) bits 
overall is achievable, where a method to compute specific redundancy rates for such distributions 
is derived. The results are specifically true for finite entropy monotonic distributions. Finally, we 
study individual sequence redundancy behavior assuming a sequence is governed by a monotonic 
distribution. We show that for sequences whose empirical distributions are monotonic, individual 
redundancy bounds similar to those in the average case can be obtained. However, even if 
the monotonicity in the empirical distribution is violated, diminishing per symbol individual 
sequence redundancies with respect to the monotonic maximum likelihood description length 
may still be achievable. 

Index Terms: monotonic distributions, universal compression, average redundancy, indi- 
vidual redundancy, large alphabets, patterns. 
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1 Introduction 



The classical setting of the universal lossless compression problem [5], [8], [9] assumes that a se- 
quence x n of length n that was generated by a source 9 is to be compressed without knowledge 
of the particular 6 that generated x n but with knowledge of the class A of all possible sources 0. 
The average performance of any given code, that assigns a length function L(-), is judged on the 
basis of the redundancy function R n (L,0), which is defined as the difference between the expected 
code length of L (•) with respect to (w.r.t.) the given source probability mass function Pg and the 
nth-order entropy of Pg normalized by the length n of the uncoded sequence. A class of sources 
is said to be universally compressible in some worst sense if the redundancy function diminishes 
for this worst setting. Another approach to universal coding [29J considers the individual sequence 
redundancy R n (L,x n ), defined as the normalized difference between the code length obtained by 
L(-) for x n and the negative logarithm of the maximum likelihood (ML) probability of the sequence 
x n , where the ML probability is within the class A. We thereafter refer to this negative logarithm as 
the ML description length of x n . The individual sequence redundancy is defined for each sequence 
that can be generated by a source in the given class A. 

Classical literature on universal compression [5], [8], [9], [23], [29] considered compression of 
sequences generated by sources over finite alphabets. In fact, it was shown by Kieffer [15] (see also 
[13] ) that there are no universal codes (in the sense of diminishing redundancy) for sources over 
infinite alphabets. Later work (sec, e.g., |21j, [25j), however, bounded the achievable redundancies 
for identically and independently distributed (i.i.d.) sequences generated by sources over large and 
infinite alphabets. Specifically, while it was shown that the redundancy does not decay if the 
alphabet size is of the same order of magnitude as the sequence length n or greater, it was also 
shown that the redundancy does decay for alphabets of size o(n). [j] 

While there is no universal code for infinite alphabets, recent work [20] demonstrated that if 

one considers the pattern of a sequence instead of the sequence itself, universal codes do exist in 

the sense of diminishing redundancy. A pattern of a sequence, first considered, to the best of our 

knowledge, in [I], is a sequence of indices, where the index ^ at time i represents the order of first 

occurrence of letter Xi in the sequence x n . Further study of universal compression of patterns [20J , 

[21J, [26J, [28j provided various lower and upper bounds to various forms of redundancy in universal 

^or two functions /(n) and g(n), f(n) = o(g(n)) if Vc, 3no, such that, Vn > no, f(n) < cg(n); f(n) — 0(g(n)) 
if 3c, no, such that, Vn > no, < /(n) < cg(n); f(n) — Q(g(n)) if 3ci,C2,no, such that, Vn > no, c\g(n) < f(n) < 
c 2 g(n). 
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compression of patterns. Another related study is that of compression of data, where the order of 
the occurring data symbols is not important, but their types and empirical counts are [30|-|31]. 

This paper considers universal compression of data sequences generated by distributions that 
are known a-priori to be monotonic. Hence, the order of probabilities of the source symbols is 
known in advance to both encoder and decoder and can be utilized as side information to improve 
universal compression performance. Monotonic distributions are common for distributions over 
the integers, including the geometric distribution and others. Such distributions do occur in image 
compression problems (see, e.g., [18] , |19j). and in other applications that compress residual signals. 
A specific application one can consider for the results in this paper is compression of the list of 
last or first names in a given city of a given population. One can usually find some monotonicity 
for such a distribution in the given population, which both encoder and decoder may be aware of 
a-priori. For example, the last name "Smith" can be expected to be much more common than 
the last name "Shannon". Another example is the compression of a sequence of observations of 
different species, where one has prior knowledge which species are more common, and which are 
rare. Finally, one can consider compressing data for which side information given to the decoder 
through a different channel gives the monotonicity order. 

Unlike compression of patterns, Foster, Stine, and Wyner, showed in [10] that there are no 
universal block codes in the standard sense for the complete class of monotonic distributions. The 
main reason is that there exist such distributions, for which much of the statistical weight lies in 
symbols that have very low probability, and most of which will not occur in a given sequence. 
Thus, in practice, even though one has the prior knowledge of the monotonicity of the distribution, 
this monotonicity is not necessarily retained in an observed sequence. Therefore, actual coding 
can be very similar to compressing with infinite alphabets, and the additional prior knowledge 
of the monotonicity is not very helpful in reducing redundancy. Despite that, Foster, Stine, and 
Wyner demonstrated codes that obtained universal per-symbol redundancy of o(l) as long as the 
source entropy is fixed (i.e., neither increasing with n nor infinite). However, instead of considering 
redundancy in the standard sense, the study of monotonic distributions resorted to studying relative 
redundancy, which bounds the ratio between average assigned code length and the source entropy. 
This approach dates back to work by Elias [7J, Rissanen [22], and Ryabko |24j . 

The work in [TO] studied coding sequences (or blocks) generated by i.i.d. monotonic distributions, 
and designed codes for which the relative block redundancy could be (upper) bounded. Unlike that 
work, the focus in [7], [22], and [24] was on designing codes that minimize the redundancy or 
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relative redundancy for a single symbol generated by a monotonic distribution. Specifically, in 
|22j . minimax codes, which minimize the relative redundancy for the worst possible monotonic 
distribution over a given alphabet size, were derived. In [23], it was shown that redundancy of 
0(loglogA;), where k is the alphabet size, can be obtained with minimax per-symbol codes. Very 
recent work [16] considered per-symbol codes that minimize an average redundancy over the class 
of monotonic distributions for a given alphabet size. Unlike [10] , all these papers study per-symbol 
codes. Therefore, the codes designed always pay non-diminishing per-symbol redundancy. 

A different line of work on monotonic distributions considered optimizing codes for a known 
monotonic distribution but with unknown parameters (see [18], [19] for design of codes for two-sided 
geometric distributions). In this line of work, the class of sources is very limited and consists of 
only the unknown parameters of a known distribution. 

In this paper, we consider a general class of monotonic distributions that is not restricted 
to a specific type. We study standard block redundancy for coding sequences generated by i.i.d. 
monotonic distributions, i.e., a setting similar to the work in [10]. We do, however, restrict ourselves 
to smaller subsets of the complete class of monotonic distributions. First, we consider monotonic 
distributions over alphabets of size k, where k is either small w.r.t. n, or of 0(n). Then, we extend 
the analysis to show that under minimal restrictions of the monotonic distribution class, there exist 
universal codes in the standard sense, i.e., with diminishing per-symbol redundancy. In fact, not 
only do universal codes exist, but under mild restrictions, they achieve the same redundancy as 
obtained for alphabets of size 0{n). The restrictions on this subclass imply that some types of fast 
decaying monotonic distributions are included in it, and therefore, sequences generated by these 
distributions (without prior knowledge of either the distribution or of its parameters) can still be 
compressed universally in the class of monotonic distributions. 

The main contributions of this paper are the development of codes and derivation of their 
upper bounds on the redundancies for coding i.i.d. sequences generated by monotonic distributions. 
Specifically, the paper gives complete characterization of the redundancy in coding with monotonic 
distributions over "small" alphabets (k = o(n 1 / 3 )) and "large" alphabets (k = 0(n)). Then, it 
shows that these redundancy bounds carry over (in first order) to fast decaying distributions. Next, 
a code that achieves good redundancy rates for even slower decaying monotonic distributions is 
derived, and is used to study achievable redundancy rates for such distributions. Lower bounds are 
also presented to complete the characterization, and are shown to meet the upper bounds in the first 
three cases (small alphabets, large alphabets, and fast decaying distributions). The lower bounds 
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turn out to result from lower bounds obtained for coding patterns. The relationship to patterns is 
demonstrated in the proofs of those lower bounds. Finally, individual sequences are considered. It is 
shown that under mild conditions, there exist universal codes w.r.t. the monotonic ML description 
length for sequences that contain the 0(n) more likely symbols, even if their empirical distributions 
are not monotonic. 

The outline of this paper is as follows. Section [2] describes the notation and basic definitions. 
Then, in section O lower bounds on the redundancy for monotonic distributions are derived. Next, 
in Section [H we propose codes and upper bound their redundancy for coding monotonic distribu- 
tions over small and large alphabets. These bounds are then extended to fast decaying monotonic 
distributions in Section [5j Finally, in Section El we consider individual sequence redundancy. 



2 Notation and Definitions 

Let x n = (xi,X2, • • • ,x n ) denote a sequence of n symbols over the alphabet £ of size k, where k 
can go to infinity. Without loss of generality, we assume that £ = {1, 2, . . . , k}, i.e., it is the set of 
positive integers from 1 to k. The sequence x n is generated by an i.i.d. distribution of some source, 
determined by the parameter vector 6 = (6\, 62, ■ ■ ■ , 0fc), where 9{ is the probability of X taking 
value i. The components of 6 are non- negative and sum to 1. The distributions we consider in 
this paper are monotonic. Therefore, 9\ > #2 > • • • > Ok- The class of all monotonic distributions 
will be denoted by A4. The class of monotonic distributions over an alphabet of size k is denoted 
by It is assumed that prior to coding x n both encoder and decoder know that G M or 

9 £ Mk, and also know the order of the probabilities in 6. In the more restrictive setting, k is 
known in advance and it is known that £ Mk- We do not restrict ourselves to this setting. In 
general, boldface letters will denote vectors, whose components will be denoted by their indices in 
the vector. Capital letters will denote random variables. We will denote an estimator by the hat 
sign. In particular, 6 will denote the ML estimator of which is obtained from x n . 

The probability of x n generated by 6 is given by Pg (x n ) = Pr (x 11 \ = 6). The average 
per-symbol^ nth-order redundancy obtained by a code that assigns length function L(-) for is 

R n (L,e) = -E e L[X n ]-He[X], (1) 
n 

where Eg denotes expectation w.r.t. 6, and Hg [X] is the (per-symbol) entropy (rate) of the source 



In this paper, redundancy is denned per-symbol (normalized by the sequence length n). However, when we refer 
to redundancy in overall bits, we address the block redundancy cost for a sequence. 
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{Hq [X n ] is the nth-order sequence entropy of 6, and for i.i.d. sources, Hg [X n ] = nHg [X]). With 
entropy coding techniques, assigning a universal probability Q (x n ) is identical to designing a uni- 
versal code for coding x n where, up to negligible integer length constraints that will be ignored, 
the negative logarithm to the base of 2 of the assigned probability is considered as the code length. 

The individual sequence redundancy (see, e.g., [29]) of a code with length function L (■) per 
sequence x n is 

R n (L, x n ) = -{L (x n ) + log P ML (x n )} , (2) 
n 

where the logarithm function is taken to the base of 2, here and elsewhere, and Pml {x n ) is the 
probability of x n given by the ML estimator 0\ 6 A of the governing parameter vector 0. The 
negative logarithm of this probability is, up to integer length constraints, the shortest possible 
code length assigned to x n in A. It will be referred to as the ML description length of x n in A. 
In the general case, one considers the i.i.d. ML. However, since we only consider 6 S A4, i.e., 
restrict the sequence to one governed by a monotonic distribution, we define 6_m G M. as the 
monotonic ML estimator. Its associated shortest code length will be referred to as the monotonic 
ML description length. The estimator Ojn may differ from the i.i.d. ML 6, in particular, if the 
empirical distribution of x n is not monotonic. The individual sequence redundancy in Ai is thus 
defined w.r.t. the monotonic ML description length, which is the negative logarithm of Pml (x n ) = 

p §m (x n ) = Pr (x n 1 e = e M e . 

The average minimax redundancy of some class A is defined as 

i?+ (A) = min sup R n (L, 6) . (3) 
L OeA 

Similarly, the individual minimax redundancy is that of the best code L (•) for the worst sequence 

x n , 

R+ (A) = min sup max — {L (x n ) + log P e (x n )} . (4) 
L g eA n 

The maximin redundancy of A is 

R~ (A) = sup min / w (dO) R n (L,0) , (5) 

w L J a 

where w(-) is a prior on A. In [5], it was shown that (A) > R~ (A). Later, however, [6], 
|24j the two were shown to be essentially equal. 
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3 Lower Bounds 



Lower bounds on various forms of the redundancy for the class of monotonic distributions can be 
obtained with slight modifications of the proofs for the lower bounds on the redundancy of coding 
patterns in p3], |20| . |21| . and [26]. The bounds are presented in the following three theorems. For 
the sake of completeness, the main steps of the proofs of the first two theorems are presented in 
appendices, and the proof of the third theorem is presented below. The reader is referred to 
EQ], EI], [25] and [26] for more details. 



Theorem 1 Fix an arbitrarily small e > 0, and let n — > oo. Then, the nth-order average max- 
imin and minimax universal coding redundancies for i.i.d. sequences generated by a monotonic 
distribution with alphabet size k are lower bounded by 

1/3 

(6) 



K (M k ) > { 



k-1 



2n l °S 



"k 5 h 2n 



(1.5 log e) 



log 2 !-- 

n (l-e)/3 



o 



log k 



for k < 



O 



log n 



for k > 



Theorem 2 Fix an arbitrarily small e > 0, and let n — > oo. Then, the nth-order average universal 
coding redundancy for coding i.i.d. sequences generated by monotonic distributions with alphabet 
size k is lower bounded by 

i- e y/3 

(7) 



R n (L,0) > { 



^log^-^log^-0(¥)> fork<i 
^■^-O(^), fork>\ 

for every code L(-) and almost every i.i.d. source E Mk, except for a set of sources A £ (n) whose 
relative volume in M.^ goes to as n — > oo. 



1/3 



Theorems [T] and [2] give lower bounds on redundancies of coding over monotonic distributions 
for the class Mk- However, the bounds are more general, and the second region applies to the 
whole class of monotonic distributions M. As in the case of patterns [20], [26], the bounds in 
©-([T]) show that each parameter costs at least 0.51og(n/fc 3 ) bits for small alphabets, and the total 
universality cost is at least 0(n 1 / 3_e ) bits overall for larger alphabets. Unlike the currently known 
results on patterns, however, we show in Section |4] that for k = 0{n) these bounds are achievable 



for monotonic distributions. The proofs of Theorems Q] and [2] are presented in Appendix A and in 



Appendix B , respectively. 



7 



Theorem 3 Let n — > oo. Then, the nth-order individual minimax redundancy for i.i.d. sequences 
with maximal letter k w.r.t. the monotonic ML description length with alphabet size k is lower 
bounded by 



(f^r-l(loge)-^-O(^), forn>k>^.n^ (8) 
l(loge)-^-Of^), fork>n. 



K (M k ) > 



Theorem [3] lower bounds the individual minimax redundancy for coding a sequence believed 
to have an empirical monotonic distribution. The alphabet size is determined by the maximal 
letter that occurs in the sequence, i.e., k = max {x\, X2, ■ ■ ■ , x n }. (If k is unknown, one can use 
Elias' code for the integers [7] using O(logZc) bits to describe k. However this is not reflected in 
the lower bound.) The ML probability estimate is taken over the class of monotonic distributions, 
i.e., the empirical probability (standard ML) estimate 6 is not Oj^ in case 6 does not satisfy the 
monotonicity that defines the class A4. While the average case maximin and minimax bounds 
of Theorem [1] also apply to (A4k), the bounds of Theorem [3] are tighter for the individual 
redundancy and are obtained using individual sequence redundancy techniques. 

Proof of Theorem [3} Using Shtarkov's normalized maximum likelihood (NML) approach [29J, 
one can assign probability 

n ( T n\ ^ P e M fo") A max 9/£M P e , (x n ) 

W E y n P Sm {y n ) E r max e , 6A4 P 6 , (yn) { > 

to sequence x n . This approach minimizes the individual minimax redundancy, giving individual 
redundancy of 

r» «?. *»> - ± log "^rr^ = 1 ** It, s-s * («-)) do 

n Q (x n ) n e )'eM J 

to every x n , specifically achieving the individual minimax redundancy. 

It is now left to bound the logarithm of the sum in (|10p . For the first two regions, we follow 
the approach used in Theorem 2 in [21] for bounding the redundancy for standard compression 
of i.i.d. sequences over large alphabets, but adjust it to monotonic distributions. Alternatively, 



one can derive the same bounds following the approach used for bounding the individual minimax 
redundancy of patterns in proving Theorem 12 in [20] . Let r/ x = (n x (l),n x (2), . . . ,n x (£)) denote 
the occurrence counts of the first t letters of the alphabet S in i". For i = k, X^i=i n ^(*) = n - 
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Now, following (fID]>. 




1 71 v 

log to + fclog— =■ -O(log^) (11) 



2 6 fc 3 b ^ 

where (a) follows from including only sequences y n that have a monotonic empirical (i.i.d. ML) 
distribution in Shtarkov's sum. Inequality (6) follows from partitioning the sequences y n into types 
as done in [21], first by the number of occurring symbols £, and then by the empirical distribution. 
Unlike standard i.i.d. distributions though, monotonicity implies that only the first £ symbols in S 
occur, and thus the choice of £ out of k in the proof in [21] is replaced by 1. Like in coding patterns, 
we also divide by £\ because each type with £ occurring symbols can be ordered in at most £\ ways, 
where only some retain the monotonicity. (Note that this step is the reason that step (6) produces 
an inequality, because more than one of the orderings may be monotonic if equal occurrence counts 
occur.) Except the division by £\, the remaining steps follow those in [21]. Retaining only the term 
£ = k yields inequality (c). Inequality (d) follows from Stirling's bound 

. /Ul\ 7n /TYl\ m f 1 1 

V^vrm • — < ml < v^vrm • — • exp <^ } . (12) 

V e / v e / [ 12m J 

Then, (e) follows from the relation between arithmetic and geometric means, and from expressing 

the number of types as the number of ordered partitions of n into k parts _ ^\ . Finally, (/) 

follows from applying (fT2|) again and by lower bounding ^ _ ^\ . 

The first region in ([8|) results directly from (jlip . The behavior is similar to patterns as shown 
in [T] for this region. As mentioned in [20], to obtain the second region, the bound is maximized 
by retaining £ = (n 1//3 e 5//18 ) /(27r) 1//3 instead of k in step (c) of (jlip . for every k > £. The bounds 
obtained are equal to those obtained for patterns because the first step (a) in (fTT|) discards all 
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the sequences whose contributions to Shtarkov's sum are different between patterns and monotonic 
distributions. A similar step is effectively done deriving the bounds for patterns. The difference 
is that in the case of patterns, components of Shtarkov's sum are reduced, but all are retained 
in the sum, while here, we omit components from the sum, corresponding to sequences with non- 
monotonic i.i.d. ML estimates. The analysis in [20] that also attains the second region of the 
bound in (jSJ) is still valid here. It differs from the steps taken above by lower bounding a pattern 
probability by a larger probability than the ML i.i.d. probability corresponding to the pattern. The 
bound used in the derivation of Theorem 12 in [20J adds a multiplicative factor to each pattern 
probability which equals the number of sequences with the same pattern and an equal i.i.d. ML 
probability. However, this similar effect is included in Shtarkov's sum for monotonic distributions 
since all these sequences do have a corresponding i.i.d. ML estimate which is monotonic, and are 
thus not omitted by step (a) of the derivation. 

The analysis in [14] yields the third region of the bound in ([8]), since, for k >n, 



where VP(?/ n ) is the pattern of the sequence y n . Inequality (a) holds because each pattern cor- 
responds to at least one sequence whose ML probability parameter estimates are ordered, i.e., 
0i > 0i+i,Vi, where the most probable index represents i = 1, the second most probable index 
i = 2, and so on. Note that the sum element on the right hand side is for a probability of a 
sequence, not a pattern, but the sum is over all patterns. The left hand side also includes sequences 
for which the probabilities are unordered. Furthermore, exchanging the letters that correspond 
to two indices with the same occurrence count will not violate monotonicity. Thus the inequality 
follows. Step (b) in (I13p is taken from [T3], where the sum on the left hand side was shown to 
equal the right hand side. This was true when summing over all patterns with up to n indices, thus 
requiring k > n. Note that this requirement does not mean that n distinct symbols must occur 
in x n , only that the maximal symbol in x n is n or greater. This concludes the proof of Theorem[3l □ 




(13) 



10 



4 Upper Bounds for Small and Large Alphabets 



In this section, we demonstrate codes that asymptotically achieve the lower bounds for £ Aik 
and h = 0(n). We begin with a theorem that shows the achievable redundancies, and devote the 
remainder of the section to describing the codes and deriving upper bounds on their redundancies. 
The theorem is stated assuming no initial knowledge of k. The proof first considers the setting 
where k is known, and then shows how the same bounds are achieved even when k is unknown in 
advance, but as long as it satisfies the conditions. 

Theorem 4 Fix an arbitrarily small e > 0, and let n — > oo. Then, there exist a code with length 
function L* (•) that achieves redundancy 



for i.i.d. sequences generated by any source 6 E Aik- 

Slightly tighter bounds are possible in the first and second regions and between them. The 
bounds presented, however, are inclusive for each of the regions. Note that the third region con- 
tains the second, but if k = o(n), a tighter bound is possible in the second region. The code 
designed to code a sequence x n is a two part code [23] that quantizes a distribution that minimizes 
the cost, and uses it to code x n . The total redundancy cost consists of the cost of describing the 
quantized distribution and the quantization cost. The second is bounded through the quantized 
true distribution of the sequence, which cannot result in lower cost than that of the chosen dis- 
tribution (which minimizes the cost). In order to achieve the low costs of the lower bound, the 
probability parameters are quantized non-uniformly, where the smaller the probability the finer the 
quantization. This approach was used in |25j and |26j to obtain upper bounds on the redundancy 
for coding over large alphabets and for coding patterns, respectively. The method used in [25] and 
[26] . however, is insufficient here, because it still results in too many quantization points due to the 
polynomial growth in quantization spacing. Here, we use an exponential growth as the parameters 
increase. This general idea was used in [28J to improve an upper bound on the redundancy of 
coding patterns. Here, however, we improve on the method presented in |28j. Another key step in 
the proof here is the fact that since both encoder and decoder know the order of the probabilities 
a-priori, this order need not be coded. It is sufficient to encode the quantized probabilities of the 




(14) 
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monotonic distribution, and the decoder can identify which probability is associated with which 
symbol using the monotonicity of the distribution. 

Proof of Theorem [4j We start with k < n 1//3 assuming k is known. Let (5 = 1/ (log n) be a 
parameter (note, that we can choose other values). Partition the probability space into J\ = |~l//3] 
intervals, 



— ), l<i<Ji- (15) 



n n 

Note that h = [l/n,2/ra), I 2 = [2/n, 4/n), . . . , Ij = p" 1 /n,2 j /n). Let kj = |6>, G denote the 
number of probabilities in that are in interval Ij. In interval j, take a grid of points with spacing 

A f = ^1^- (16) 

Note that to complete all points in an interval, the spacing between two points at the boundary of an 
interval may be smaller. There are [logn] intervals. Ignoring negligible integer length constraints 
(here and elsewhere), in each interval, the number of points is bounded by 

1 Fn 

l J il^2'VI' V i:i = l,2,...,^i, (17) 
where | • | denotes the cardinality of a set. Let the grid 

( 1 1 2Vk 2 2 iVk \ , , 

r = (r 1 ,r 2 ,...)= -,-+...,-,-+... (18) 
\nn n L -° n n n L -° I 

be a vector that takes all the points from all intervals, with cardinality 

B 1 t\r\<l.^ll ogn \. (19) 

Now, let ip = ((pi, (f2, . . . , (pk) be a monotonic probability vector, such that ^2fi = 1, <p\ > 
<P2 > • • • > Pk > 0, and also the smaller k — 1 components of cp are either or from r, i.e., ^ € (tU 
{0}), i = 2, 3, . . . , k. One can code x n using a two part code, assuming the distribution governing 
x 11 is given by the parameter <p. The code length required (up to integer length constraints) is 

L (x n \<p) = log k + L R (p) - log P v (x n ) , (20) 

where log k bits are needed to describe how many letter probabilities are greater than in cp, and 
Lfi{<p) is the number of bits required to describe the quantized points of p. 

The vector p> can be described by a code as follows. Let k^ be the number of nonzero letter 
probabilities hypothesized by p. Let bi denote the index of pi in r, i.e., p>i = T& r Then, we 
will use the following differential code. For p^ we need at most 1 + log + 21og(l + log b^ ) 

12 



bits to code its index in r using Elias' coding for the integers [7]. For (pi—i, we need at most 
1 + log(6j_i — bi + 1) + 21og[l + log(6j_i — bi + 1)] bits to code the index displacement from the 
index of the previous parameter, where an additional 1 is added to the difference in case the two 
parameters share the same index. Summing up all components of cp, and taking b^ +1 = 0, 

kip kp 

L R (<p) < ^-l + ^log(6 i -6 i+ i + l) + 2^1og[l + log(6 i -6 i+ i + l)] 

i=2 i=2 
(a) f>. i h _ 1 R -I- h 1 

< (fc - 1) + (k - 1) log 1 + + 2{k - 1) log log ^-f + o(k) 

ffl (l + £ )^ilcg^^. (21) 

Inequality (a) is obtained by applying Jensen's inequality once on the first sum, twice on the second 
sum utilizing the monotonicity of the logarithm function, and by bounding k\p by k and absorbing 
low order terms in the resulting o(k) term. Then, low order terms are absorbed in e, and (|19|) is 
used to obtain (b). 

To code x n , we choose <p which minimizes the expression in (I20p over all <p, i.e., 

L*(x n ) =mmL(x n \ l p) = L{x n \(p). (22) 
The pointwise redundancy for x n is given by 

Pa (r n ) 

nR n (L*,x n ) = L* (x n ) + log P e = log k + L* R (0) + log -^—f (23) 

cq, \x ) 

Note that the pointwise redundancy differs from the individual one, since it is defined w.r.t. the 
true probability of x n . 

To bound the third term of (|23|) . let 6' be a quantized still monotonic version of 6 onto r, i.e., 
91 G (t U {0}), i = 2, 3, . . . , k, where if 9{ > 44> Q\ > as well. Define the quantization error, 

5 i = 9 i -9[. (24) 

The quantization is performed from the smallest parameter 9k to the largest, where monotonicity 
is retained, as well as minimal absolute quantization error. This implies that 9i will be quantized 
to one of the two nearest grid points (one smaller and one greater than it). It also guarantees that 
|Ji | < A^, where ji is the index of the interval in which 92 is contained, i.e., #2 £ Ij 2 - Now, since 
6' is included in the minimization of (|22p . we have, for every x n , 

L* (x n ) < L (x n \6') , (25) 
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and also 



Pa (r n ) 

nR n (L*,x n ) <\ogk + L R (9') + log (26) 

Pqi (X n ) 



Averaging over all possible x n , the average redundancy is bounded by 

Pe(X r 



nRn{L*,0) = log k + E e L* R (<p)+E e log 



< log k + E e L R (0') + E e log g ^ . (27) 
The second term of (j27j) is bounded with the bound of (|2ip . and we proceed with the third term. 

< 7i(Iog e ) + ( i n(loge)X:| 

i=l 4 i=l * 

( < } A; log e + 2(1 ° ge)fc ]T % • ^ < 5(log e)k. (28) 

Equality (a) is since the argument in the logarithm is fixed, thus expectation is performed only on 
the number of occurrences of letter i for each letter. Representing Oi = B\-\- b% yields equation (b). 
We use ln(l + x) < x to obtain (c). Equality id) is obtained since all the quantization displacements 
must sum to 0. The first term of inequality (e) is obtained under a worst case assumption that 
9i <C 1/n for i > 2. Thus it is quantized to 9[ = 1/n, and the bound |<5j| < 1/n is used. The 
second term is obtained by separating the terms into their intervals. In interval j, the bounds 
9'. > n^ _1 ^/n, and \5i\ < y/knP^/n 1 ' 5 are used, and also nP = 2. Inequality (/) is obtained since 

Ji -h 

kjn jf} = Y kj2 j < 2n. (29) 

3=1 3=1 

Inequality (I29D is obtained since k\ < n, k<i < (n — k\)/2, k% < (n — k\)/& — &2/2, and so on, until 

The reason for these relations are the lower limits of the J\ intervals that restrict the number 
of parameters inside the interval. The restriction is done in order of intervals, so that the used 
probabilities are subtracted, leading to the series of equations. 

Plugging the bounds of (f2"T|) and (|28|) into ([27]) . we obtain, 

nR„(L',e) < log k + (1 + e) log " ''"I" 1 " + 5(log e)k 

< (1 + ^tll.^, (31) 
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where we absorb low order terms in e' . Replacing e' by e normalizing the redundancy per symbol 
by n, the bound of the first region of (fT4"|) is proved. 

We now consider the larger values of k, i.e., n 1 / 3 < k = 0(n). The idea of the proof is the 
same. However, we need to partition the probability space to different intervals, the spacing within 
an interval must be optimized, and the parameters' description cost must be bounded differently, 
because now there are more parameters quantized than points in the quantization grid. Define the 
jth interval as 



-T , 1 < 3 < J2, (32) 



9 ' 9 



where J2 = [2//3] = [~21ogn~|. Again, let kj = \9{ E Ij\ denote the number of probabilities in that 
are in interval Ij. It could be possible to use the intervals as defined in (|15p . but this would not 
guarantee bounded redundancy in the rate we require if there are very small probabilities 0j <C 1/n. 
Therefore, the interval definition in (|15p can be used for larger alphabets only if the probabilities 
of the symbols are known to be bounded. Define the spacing in interval j as 

*f = (33) 



where a is a parameter to be optimized. Similarly to (|17p . the interval cardinality here is 

<0.5-n Q , Vj: j = l,2,...,J 2 , (34) 
In a similar manner to the definition of r in (1181). we define 



1 1 2 2 2 4 

' ' ' ' n 2 ' n 2 n 2+a ' ' n 2 ' n 2 n 2+Q ' ' 



The cardinality of rj is 



B 2 = \rj\ < 0.5 • n a [21ogn] < n a [logn] . (36) 



We now perform the encoding similarly to the small k case, where we allow quantization to 
nonzero values to the components of (f up to i = n 2 . (This is more than needed but is possible 
since 771 = 1/n 2 .) Encoding is performed similarly to the small k case. Thus, similarly to (|27p . we 
have 

nR n (L* , 6) < 2 log n + E g L R (0') + Eg log > (37) 

where the first term is due to allowing up to k = n 2 . Since usually in this region k > B2 (except 
the low end) , the description of vectors <p and 0' is done by coding the cardinality of | (pi = r\j \ and 
Wi = Vj\y respectively, i.e., for each grid point the code describes how many letters have probability 
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quantized to this point. This idea resembles coding profiles of patterns, as done in [20]. However, 
unlike the method in [20J, here, many probability parameters of symbols with different occurrences 
are mapped to the same grid point by quantization. The number of parameters mapped to a grid 
point of t] is coded using Elias' representation of the integers. Hence, in a similar manner to (|21l) . 

L R (0') < £ {1 + log [\e>=r }j \+l)+ 21og[l + log (^ = 77,-1 + 1)]} 
< £ 2 + B 2 log + 25 2 log log + o (B 2 ) 



< 



(38) 



(1 + e)(logn) (log ^Jrj) n a , for n a < k = o(n), 
(l + e)(l - a) (log n) 2 n a , for n a < k = 0(n). 

The additional 1 term in the logarithm in (a) is for occurrences, (b) is obtained similarly to step 
(a) of (|2ip . absorbing all low order terms in the last term. To obtain (c), we first assume, for the 
first region, that krf 3> B 2 (an assumption that must be later validated with the choice of a). 
Then, low order terms are absorbed in e. The extra n e factor is unnecessary if k ^> B 2 - The second 
region is obtained by upper bounding k without this factor. It is possible to separate the first 
region into two regions, eliminate this factor in the lower region, and obtain a more complicated, 
yet tighter, expression in the upper region, where k ~ 0(n 1 / 3 ). 

Now, similarly to (j28|) . we obtain 



L _L 

J2 



1=1 



^ OW + ^iiE^ ^ 4(loge)n 1 - 2 « + 0(l). (39) 

The first term of inequality (a) is obtained under the assumption that k = 0(n), 9[ > 1/n 2 , and 
< 1/n 2 . For the second term |<5j| < /n 2+a , and 6[ > n^~ 1 ^ /n 2 . Inequality (b) is obtained 
in a similar manner to inequality (/) of (|28p . where the sum is shown similarly to be 2n 2 . 

Summing up the contributions of (|38p and (|39|) in (|37|) . it is clear that a = 1/3 minimizes 
the total cost (to first order). This choice of a also satisfies the assumption of step (c) in (|38p . 
Using a = 1/3, absorbing all low order terms in e and normalizing by n, we obtain the remaining 
two regions of the bound in (1141) . It should be noted that the proof here would give a bound of 
( n i/3+£) up to A: = 0(n 4 / 3 ). If the intervals in (|T5j) were used for bounded distributions, the 
coefficients of the last two regions will be reduced by a factor of 2. Additional manipulations on 
the grid rj may reduce the coefficients more (see, e.g., [28]). 
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The proof up to this point assumes that k is known in advance. This is important for the code 
resulting in the bound for the first region because the quantization grid depends on k. Specifically, 
if in building the grid, k is underestimated, the description cost of (p increases. If k is overestimated, 
the quantization cost will increase. Also, if the code of the second region is used for a smaller k, a 
larger bound than necessary results. To solve this, the optimization that chooses L* {x 11 ) is done 
over all possible values of k (greater than or equal to the maximal symbol occurring in x n ), i.e., 
every greater k in the first region, and the construction of the code for the other regions. For every 
k in the first region, a different construction is done, using the appropriate k to determine the 
spacing in each interval. The value of k yielding the shortest code word is then used, and O(logra) 
additional bits are used at the prefix of the code to inform the decoder which k is used. The analysis 
continues as before. This does not change the redundancy to first order, giving all three regions of 
the bound in (|14p . even if k is unknown in advance. This concludes the proof of Theorem [U □ 



5 Upper Bounds for Fast Decaying Distributions 

This section shows that with some mild conditions on the source distribution, the same redundancy 
upper bounds achieved for finite monotonic distributions can be achieved even if the monotonic 
distribution is over an infinite alphabet. The key observation that allows this is that a distribution 
that decays fast enough will result in only a small number of occurrences of unlikely letters in a 
sequence. These letters may very likely be out of order, but since there are very few of them, they 
can be handled without increasing the asymptotic behavior of the coding cost. More precisely, fast 
decaying monotonic distributions can be viewed as if they have some effective bounded alphabet 
size, where occurrences of symbols outside this limited alphabet are rare. We present two theorems 
and a corollary that show how one can upper bound the redundancy obtained when coding with 
some unknown distribution. The first theorem provides a slightly stronger bound (with smaller 
coefficient) even for k = O(n), where the smaller coefficient is attained by improved bounding, 
that more uniformly weights the quantization cost for minimal probabilities. In the weaker version 
of the results presented here, if the distribution decays slower and there are more low probability 
symbols, the redundancy order does increase due to the penalty of identifying these symbols in 
a sequence. However, we show, consistently with the results in [10J, that as long as the entropy 
of the source is finite, a universal code, in the sense of diminishing redundancy per symbol, does 
exist. We begin with stating the two theorems and the corollary, then the proofs are presented. 
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The section is concluded with three examples of typical monotonic distributions over the integers, 
to which the bounds are applied. 



5.1 Upper Bounds 

We begin with some notation. Fix an arbitrary small e > 0, and let n — ► oo. Define m = m p = n p 
as the effective alphabet size, where p > e. (Note that p = (logm)/(logn).) Let 

A f ^logA, for m = o(n 1 / 3 ), 

K n (m) = \ 2 { J (40) 

{ \ " (p+ I) {p + £ ~ I) (logra) 2 ra 1/3 , otherwise. 

Theorem 5 /. Fix an arbitrarily small e > 0, and let n — > oo. Lei i" 6e generated by an i.i.d. 
monotonic distribution 6 € Ai. If there exists m* , such that, 

n9ilogi = o[K n (m*)], (41) 

i>m* 

then, there exists a code with length function L*(-), such that 

R n (L*,e)<^^n n (m*) (42) 
n 

for the monotonic distribution 6. 

II. If there exists m* for which p* = o (n 1 / 3 /(logn)) ; such that, 

J2^ogi = o(l), (43) 

i>m* 

then, there exists a universal code with length function L*(-), such that 

R n (L*,0) = o(l). (44) 

Theorem [5] implies that if a monotonic distribution decays fast enough, its effective alphabet 
size does not exceed 0(n p ), and, as long as p is fixed, bounds of the same order as those obtained for 
finite alphabets are achievable. Specifically, very fast decaying distributions, although over infinite 
alphabets, may even behave like monotonic distributions with o (n 1 / 3 ) symbols. The condition in 
(|4ip merely means that the cost that a code would obtain in order to code very rare symbols, that 
are larger than the effective alphabet size, is negligible w.r.t. the total cost obtained from other, 
more likely, symbols. Note that for m = n, the bound is tighter than that of the third region 
of Theorem [J and a constant of 5/9 replaces 2/3. The second part of the theorem states that if 
the decay is slow, but the cost of coding rare symbols is still diminishing per symbol, a universal 
code still exists for such distributions. However, in this case the redundancy will be dominated by 
coding the rare (out of order) symbols. This result leads to the following corollary: 

18 



Corollary 1 As n — * oo, sequences generated by monotonic distributions with Hg(X) = 0(1) are 
universally compressible in the average sense. 

Corollary Q] shows that sequences generated by finite entropy monotonic distributions can be com- 
pressed in the average with diminishing per symbol redundancy. This result is consistent with the 
results shown in |lUj . 

While Theorem [5] bounds the redundancy decay rate with two extremes, a more general theorem 
can be used to provide some best redundancy decay rate that a code can be designed to adapt to 
for some unknown monotonic distribution that governs the data. As the examples at the end of 
this section show, the next theorem is very useful for slower decaying distributions. 

Theorem 6 Fix an arbitrarily small e > 0, and let n — > oo. Let x n be generated by an i.i.d. 
monotonic distribution G M. Then, there exists a code with length function L*(-), that achieves 
redundancy 

nR n (L*,6) < (1 + e)- 

min J -• (p + 2a) (p- a) (log nfn a + 5(log e)n 1 ~ 2a + (l + -] n V 9i log i \ (45) 

for coding sequences generated by the source 6. 

We continue with proving the two theorems and the corollary. 

Proof : The idea of the proof of both theorems is to separate the more likely symbols from the 
unlikely ones. First, the code determines the point of separation m = n p . (Note that p can be 
greater than 1.) Then, all symbols i < m are considered likely and are quantized in a similar 
manner as in the codes for smaller alphabets. Unlike bounded alphabets, though, a more robust 
grid is used here to allow larger values of m. Coding of occurrences of these symbols uses the 
quantized probabilities. The unlikely symbols are coded hierarchically. They are first merged into 
a single symbol, and then are coded within this symbol, where the full cost of conveying to the 
decoder which rare symbols occur in the sequence is required. Thus, they are presented giving 
their actual value. As long as the decay is fast enough, the average cost of conveying these symbols 
becomes negligible w.r.t. the cost of coding the likely symbols. If the decay is slower, but still fast 
enough, as the case described in condition (|43p . the coding cost of the rare symbols dominates the 
redundancy, but still diminishing redundancy can be achieved. In order to determine the best value 
of m for a given sequence, all values are tried and the one yielding the shortest description is used 
for coding the specific sequence x n . 
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Let to > 2 determine the number of likely symbols in the alphabet. For a given to, define 



S, 



m — / "!i 



(46) 



as the total probability of the remaining symbols. Given 9, m and S m , a probability 



p(x n \ m ,s m ,e; 



A 



IT 

.2=1 



n x (i) 



n x (x > m) 



(47) 



can be computed for x n , where n x (i) is the occurrence count of symbol i in x n , and n x {x > to) 
is the count of all symbols greater than m in x n . This probability mass function clusters all large 
symbols (with small probabilities) greater than to into one symbol. Then, it uses the ML estimate 
of each of the large symbols to distinguish among them in the clustered symbol. 

For every m, we can define a quantization grid £ m for the first to probability parameters of 
0. The idea is similar to that used for all probability parameters in the proof of Theorem |U If 
to = o(n 1 / 3 ), we use $ >m — T m , where T m is the grid defined in (|18p where to replaces k. Otherwise, 
we can use the definition of rj in (|35|) , However, to obtain tighter bounds for large to, we define a 
different grid for the larger values of to following similar steps to those in (|32| ) - (j36| ) . First, define 
the jth interval as 



n 



, 1<3<J, 



pi 



(48) 



n p+2a ' n p+2a 

where p = (log to) /(log n) as defined above, a is a parameter, and j3 = l/(logn) as before. Within 
the jth interval, we define the spacing in the grid by 



A 



j n p+3a ' 

As in ([M]), 

\Ij\ <0.5-n a , Vj:j = l,2,...,J p , 
and the total number of intervals is 



j p = \(p + 2a)logn] . 



(49) 



(50) 



(51) 



Similarly to (|35|) . £ m is defined as 
Cm = (£l>£2, • • •) r 



1 1 2 

+ 



2 2 4 

+ 



n p+2a ' ^p+2a ^p+3o ' ' ' ' ' n p+2a ' ^p+2a n p+3a 



(52) 



The cardinality of £ m is thus 



A 



B p = \i m \ <0.5-n Q r(p + 2a)logn] 



(53) 
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An mth order quantized version m of 6 is obtained by quantizing 6^, i = 2, 3, . . . , m onto £ TO , 
such that 9[ E £ m for these values of i. Then, the remaining cluster probability S m is quantized into 
S' m G [1/n, 2/n, . . . , 1]. The parameter Q' x is constrained by the quantization of the other parameters. 
Quantization is performed in a similar manner as before, to minimize the accumulating cost and 
retain monotonicity. 

Now, for any m > 2, let <p m be any monotonic probability vector of cardinality m whose last 
m — 1 components are quantized into £ m , and let a m E [1/n, 2/n, . . . , 1] be a quantized estimate of 
the total probability of the remaining symbols, such that YliLi fi,m + cr m = 1, where (pi iTn is the ith 
component of cp m . If m, a m and cp m are known, a given x n can be coded using P (x n \m, a m , ip m ) 
as defined in (j47|) . where o m replaces S m , and the m components of cp m replace the first m com- 
ponents of 9. However, in the universal setting, none of these parameter are known in advance. 
Furthermore, neither the symbols greater than m nor their conditional ML probabilities are known 
in advance. Therefore, the total cost of coding x n using these parameters requires universality costs 
for describing them. The cost of universally coding x n assigning probability P (x n \m, a m , <f m ) to it 
thus requires the following five components: 1) m should be described using Elias' representation 
with at most 1 + plogn + 21og(l + plogn) bits. 2) The value of a m in its quantization grid should 
be coded using logn bits. 3) The m components of cp m require Lr (<p m ) (which is bounded below) 
bits. 4) The number c x {x > m) of distinct letters in x n greater than m is coded using logn bits. 
5) Each letter i > m in x n is coded. Elias' coding for the integers using 1 + logi + 21og(l + logz) 
bits can be used, but to simplify the derivation we can also use the code, also presented in [7], that 
uses no more than 1 + 21ogi bits to describe i. In addition, at most logn bits are required for 
describing n x (i) in x n . For n — > oo, m 3> 1, and e > arbitrarily small, this yields a total cost of 

L(x n \m,a m ,(f m ) < — log P (x n \m,a m , <p m ) + Lr (<p m ) + [(1 + e)p + c x (x > m) + 2] logn 

+c x (x>m) + 2 l °Sh (54) 

where we assume m is large enough to bound the cost of describing m by (1 + e)p\ogn. 
The description cost of cp m for m = o{n 1 /^) is bounded by 

7Tt — 1 Th 

L R (<p m ) < (i + s)— — log -3 (55) 

using (I2ip . where m replaces k. The (logn) 2 factor in (|21|) can be absorbed in e since we limit m 
to o(n 1 / 3 ), unlike the derivation in (|2ip . For larger values of m, we describe symbol probabilities of 
ip m in the grid £ m in a similar manner to the description of O(n) symbol probabilities in the grid 
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rf. Similarly to ()38j) . we thus have 

n p + B n p + B 

L R (<p m ) < B p + B p \og p +2i? p loglog P +o{B p ) 

Op Dp 

< (p + 2a)(p + e -a) (log n) 2 n a (56) 

where to obtain inequality (a) , we first multiply n p by n e in the numerator of the argument of the 
logarithm. This is only necessary for p — > a to guarantee that n p+e ^> B p . Substituting the bound 
on Bp from ([53]) . absorbing low order terms in the leading e, yields the bound. 

A sequence x n can now be coded using the universal parameters that minimize the length of 
the sequence description, i.e., 

L* (x n ) = min min min L (x n \m', a m >, <p m i) < L (x n \m, S' m , 0' m ) , (57) 

m '- 2 °m^[bb-A <p m ,wet m :i>2 " 

where 6' m and S' m are the true source parameters quantized as described above, and the inequality 
holds for every m. Note that the maximization on m! should be performed only up to the maximal 
symbol the occurs in x n . 

Following (fM|) - (p)T|) . up to negligible integer length constraints, the average redundancy using 
L*{-) is bounded, for every m > 2, by 

nR n (L*,6) = E e [L* (X n )+ log P (X n )] 

< E e [L (X n \m,S' m , 9' m ) + log P e (X n )] 

(6) p I YTt\ 

£ E » '° g p (x« I, si <Q + L " ^ + 2 E J M' e *"> 1 °g' 

\ 1 'in' in) i>m 

+ (1 + e) [E e C x (X>m)+ p + 2] log n (58) 

where (a) follows from ([57]) . and (b) follows from averaging on ([53| with a m = S' m , and ip m = 6' m , 
where the average on c x (x > m) is absorbed in the leading e. 

Expressing Pg (x n ) as 



p» fa' 



n 



i<m 



n x (i) 

• ^ (x>m) ■ II ( fr_ ) . ( 59 ) 
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and denning 5s = S m — S' m , the first term of (|58|) is bounded, for the upper region of m, by 



E e log 



P e (X r 



P(X« | m,S' m ,e' m ) 



< E e 



N x (i) log -p- + N x (X > m) log 



i=i 



o.,, 



^N x (i) log 



h/ S n 



N x (i)/N x (X > m) 



( a ) ^ Oi S m 

< n ■ 22 Oi log + n Sm log 

i = ± i,m m 



(b) 

< n(loge) 



< (log e) • + 2 ( lo g e)^ 1_P " 4a • E k ^ + lo § e 



(d) 

< h(\oge)n l ~ 2a + loge, 



(60) 



where (a) is since for the third term, the conditional ML probability used for coding is greater than 
the actual conditional probability assigned to all letters greater than m for every x n . Hence, the 
third term is bounded by 0. For the other terms expectation is performed. Inequality (b) is obtained 
similarly to (|28p where quantization includes the first m components of 9 and the parameter S m . 
Then, inequality (c) follows the same reasoning as step (a) of (|39|) . The first term bounds the worst 
case in which all n p symbols are quantized to l/n p+2a with |<5j| < l/n p+2a . The second term is 
obtained where 0' im > n ^~ 1 ' )l3 /n p+2a and \5f\ < n^/n p+3a for Q{ £ Ij, and kj = \9i £ Ij\ as before. 
The last term is since S' m > 1/n and \5s\ < 1/n. Finally, (d) is obtained similarly to step (6) of 
(I39|) . where as in <[29j), £ kjniP < 2n p+2a . For m = o(n 1 / 3 ), the same initial steps up to step (6) in 
(|60p are applied, and then the remaining steps in (128p are applied to the left sum with m replacing 
k, yielding a total quantization cost of 5(loge)m + loge. 

To bound the third and fourth terms of (1581). we realize that 



P e (i £ X n ) = 1 — (1 — 9i) n < n6i. 



(61) 



Similarly, 



E C X (X >m) = ^P e {i£ X n ) < nS„ 



(62) 
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Combining the dominant terms of the third and fourth terms of (|58p , we have 
2^2P (i£ X n ) log % + (1 + e)E e C x {X > m) log n 

( => Y, Pe ( ie t 2 lo g * + ( X + e ) lQ g n \ 

< (2 + i±£^ ^P e («G X") log t < (2 + — ] n £ ft log i (63) 

^ P ' i>m ^ P J ,; >m 

where (a) is because EgC x (X > m) = Yli>m^ (* e ^ n )> (&) i s because for i > m = n p , logi > 
plogn, and (c) follows from (16ip . Given p > e for an arbitrary /ixec? e > 0, the resulting coefficient 
above is upper bounded by some constant k. 

Summing up the contributions of the terms of (158|) from (|28l) . f)55f) . and (|63f) . absorbing low 
order terms in a leading e', we obtain that for m = o(n 1 / 3 ), 

nR n (L*,0) < (l + £ ') 1 ^^\og^ + K nJ2^gi. (64) 

i>m 

For the second region, substituting a = 1/3, and summing up the contributions of ([60]) . ([56]) . and 
(|63p to (|58p . absorbing low order terms in e', we obtain 

nR n (L*,0) < + (/>+§) f^ 6 '"^) (logn) 2 n 1 / 3 + Kn ^ logi (65) 

Since (164p - (|65p hold for every m > n £ , there exists m* for which the minimal bound is obtained. 
To bound the redundancy, we choose this m* . Now, if the condition in (|4ip holds, then the second 
term in (|64p and (|65p is negligible w.r.t. the first term. Absorbing it in a leading e, normalizing by 
n, yields the upper bound of (|42p . and concludes the proof of the Part I of Theorem 

For Part II of Theorem EJ we consider the bound of the second region in (|65p . If there exists 
p* = o (n 1 / 3 /(log re)) for which the condition in (|43p holds, then both terms of (|65p are of o(re), 
yielding a total redundancy per symbol of o(l). The proof of Theorem [5] is concluded. □ 



To prove Corollary [H we use Wyner's inequality [32], which implies that for a finite entropy 
monotonic distribution, 

]T % log % = E e [log X] < H e [X] . (66) 

i>l 

Since the sum on the left hand side of (|66p is finite if Hg[X] is finite, there must exist some uq such 
that ^2 i>nQ di^ogi = o(l). Let n > no, then for m* = n and p* = 1, condition (03]) is satisfied. 
Therefore, (|4"4"j) holds, and the proof of Corollary Q] is concluded. □ 



24 



We now consider only the upper region in (|58p with parameters a and p taking any valid value. 
(The code leading to the bound of the upper region can be applied even if the actual effective 
alphabet size is in the lower region.) We can sum up the contributions of (|60p . (|56p . and (|63|) to 
(|58p . absorbing low order terms in e. Equation (]56p is valid without the middle e term as long as 
p > a + e. Since, in the upper region of m, i > m is large enough, Elias' code for the integers 
can be used costing (1 + e) logi to code i, with e > which can be made arbitrarily small. Hence, 
the leading coefficient of the bound in (|63|) can be replaced by (1 + e)(l + 1/ p). This yields the 
expression bounding the redundancy in (145p . This expression applies to every valid choice of a and 
p, including the choice that minimizes the expression. Thus the proof of Theorem [6] is concluded. □ 

5.2 Examples 

We demonstrate the use of the bounds of Theorems [5] and [6] with three typical distributions over 
the integers. We specifically show that the redundancy rate of O (n 1//3+e ) bits overall is achievable 
when coding many of the typical monotonic distributions, and, in fact, for many distributions 
faster convergence rates are achievable with the codes provided in proving the theorems above. 
The assumption that very few unlikely symbols are likely to appear in a sequence generated by a 
monotonic distribution, which is reflected in the conditions in (|4ip and (|43p . is very realistic even 
in practical examples. Specifically, in the phone book example, there may be many rare names, but 
only very few of them may occur in a certain city, and the more common names constitute most of 
any possible phone book sequence. 

5.2.1 Fast Decaying Distributions Over the Integers 

Consider the monotonic distributions over the integers of the form, 



where 7 > 0, and a is a normalization coefficient that guarantees that the probabilities over all 
integers sum to 1. It is easy to show by approximating summation by integration that for some 
m — ► 00, 



0. 



a 



(67) 



% — 



jl+7' 



i = l,2,... 



i>m 




(68) 



(69) 
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For m = n p and fixed p, the sum in (|4ip is thus O (re 1_P7 logn), which is o (n 1 / 3 (log re) 2 ) for every 
P > 2/(37). Specifically, as long as 7 < 2 (slow decay), the minimal value of p required to guarantee 
negligibility of the sum in (|4ip is greater than 1/3. Using Theorem this implies that for 7 < 2, 
the second (upper) region of the upper bound in (|42l) holds with the minimal choice of p* = 2/(37). 
Plugging in this value in the second region of (140p (i.e., in (I42j) ) yields the upper bound shown below 
for this region. For 7 > 2, 2/(37) < V^- Hence, (|4ip holds for m* = o (re 1 / 3 ). This means that for 
the distribution in (|67|) with 7 > 2, the effective alphabet size is o (re 1 / 3 ), and thus the achievable 
redundancy is in the first region of the bound of (I42p . Thus, even though the distribution is over 
an infinite alphabet, its compressibility behavior is similar to a distribution over a relatively small 
alphabet. To find the exact redundancy rate, we balance between the contributions of (|55p and 
(I63h in (|58p . As long as 1 — p7 < p, condition (I4ip holds, and the contribution of small letters 
in (|63p is negligible w.r.t. the other terms of the redundancy. Equality, implying p* = 1/(1+7), 
achieves the minimal redundancy rate. Thus, for 7 > 2, 

nR n (L*,0) < (1 + e) H 2p * + ^ n 1 '^ log re + — (1 - 3p*) logn 

7 2 

( 3+7 -1 3_\ 
^±l + i_l±l Ui^logre (70) 

where the first term in (a) follows from the bounds in (|63p and (|69p . with m = n p , and the second 
term from that in (|55|) . and (6) follows from p* = 1/(1 +7). Note that for a fixed p* , the factor 3 
in the first term can be reduced to 2 with Elias' coding for the integers. The results described are 
summarized in the following corollary: 



Corollary 2 Let 6 G M. be defined in (61). Then, there exists a universal code with length function 



L*(-) that has only prior knowledge that 6 E A4, that can achieve universal coding redundancy 



Rn(L*,0) < { 



, 3 + 7 T+-, (71) 



(1 + £ ) ^±x + ^±X r^ I ZJogn ) /or 7 > 2 _ 



Corollary [2] gives the redundancy rates for all distributions defined in (|67p . For example, if 7 = 1, 
the redundancy is O (re 1 / 3 (logre) 2 ) bits overall with coefficient 2/9. For 7 = 3, 0(re 1 / 4 logn) bits 
are required. For faster decays (greater 7) even smaller redundancy rates are achievable. 

5.2.2 Geometric Distributions 

Geometric distributions given by 

9 i =p(l-p) i - 1 ; 1 = 1,2,..., (72) 
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where < p < 1, decay even faster than the distribution over the integers in (|67p. Thus their 
effective alphabet sizes are even smaller. This implies that a universal code can have even smaller 
redundancy than that presented in Corollary [2] when coding sequences generated by a geometric 
distribution (even if this is unknown in advance, and the only prior knowledge is that 6 £ M). 
Choosing m = I ■ log n, the contribution of low probability symbols in (I63p to (158|) can be upper 
bounded by 

. (a) 

2n Oi (log i + log n) < 2n(l - p) m logn + O (n(l - p) m logm) 

i>m 



(1) 2n 1+ " og ( 1 - p )(logn) + O ( n l+n °^ 1 -^ log logn) (73) 



where (a) follows from computing S m using geometric series, and bounding the second term, and 
(b) follows from substituting m = i\ogn and representing (1 — pY logn as n^ log ^ 1_p ' 1 . As long as 
t > l/(— log(l —p))-, the expression in (|73|) is O(logn), thus negligible w.r.t. the redundancy upper 
bound of (|42p with m* = I* logn = (logn)/(— log(l — p)). Substituting this m* in (j42[) . we obtain 
the following corollary: 



Corollary 3 Let 6 € Ai be a geometric distribution defined in {lUty . Then, there exists a universal 
code with length function L*(-) that has only prior knowledge that £ A4, that can achieve universal 
coding redundancy 

R,(L',e)< 1 + ' -fil2g. (74) 
— 21og(l — p) n 

Corollary [3] shows that if 6 parameterizes a geometric distribution, sequences governed by 6 can be 
coded with average universal coding redundancy of O ((logn) 2 ) bits. Their effective alphabet size 
is O(logn), implying that larger symbols are very unlikely to occur. For example, for p = 0.5, the 
effective alphabet size is logn, and 0.5(logn) 2 bits are required for a universal code. For p = 0.75, 
the effective alphabet size is (logn)/2, and (logn) 2 /4 bits are required by a universal code. 

5.2.3 Slow Decaying Distributions Over the Integers 

Up to now, we considered fast decaying distributions, which all achieved the 0(n 1 / 3+e /n) redun- 
dancy rate. We now consider a slowly decaying monotonic distribution over the integers, given 
by 

h = a N2+7 > » = 2,3,..., (75) 
i (log?) ' 

where 7 > and a is a normalizing factor (see, e.g., [12], [27]). This distribution has finite 
entropy only if 7 > (but is a valid infinite entropy distribution for 7 > —1). Unlike the previous 
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distributions, we need to use Theorem [6] to bound the redundancy for coding sequences generated 
by this distribution. Approximating the sum with an integral, the order of the third term of (|45|) 
is 

n V s Oi log i = O ( - H . 1 . (76) 

i>m \\ o I / 

In order to minimize the redundancy bound of (145p . we define p = n . For the minimum rate, all 
terms of (|45p must be balanced. To achieve that, we must have 

a + 21 = 1 - 2a = 1 - j£. (77) 

The solution is a = 7/(4 + 37), and I = 2/(4 + 37). Substituting these values in the expression of 
(|45j) . with p = n e , results in the first term in (|45j) dominating, and yields the following corollary: 



Corollary 4 Lei G M. be defined in ( |75| ) mtt 7 > 0. Then, there exists a universal code with 
length function L*(-) i/iai /ias on/y prior knowledge that G t/iai can achieve universal coding 
redundancy 

7+4 _ 

77,37+4 (\qp fi) 2 

R n (L ,0) < (1 + e) ^ tL - ( 78 ) 



Due to the slow decay rate of the distribution in (|75p . the effective alphabet size is much greater 

2/7 

here. For 7 = 1, for example, it is n n . This implies that very large symbols are likely to appear 
in x n . As 7 increases though, the effective alphabet size decreases, and as 7 — ► 00, m — > n. The 
redundancy rate increases due to the slow decay. For 7 > 1, it is O (n 5 / 7 (log n) 2 /n) . As 7 — > cxd, 
since the distribution tends to decay faster, the redundancy rate tends to the finite alphabet rate 
of O (n 1 / 3 (log ra) 2 /ra) . However, as the decay rate is slower 7 — > 0, a non-diminishing redundancy 
rate is approached. Note that the proof of Theorem [6] does not limit the distribution to a finite 
entropy one. Therefore, the bound of f|T8j) applies, in fact, also to — 1 < 7 < 0. However, for 7 < 0, 
the per-symbol redundancy is no long diminishing. 



6 Individual Sequences 

In this section, we first show that individual sequences whose empirical distributions obey the 
monotonicity constraints can be universally compressed as well as the average case. We then 
study compression of sequences whose empirical distributions may diverge from monotonic. We 
demonstrate that under mild conditions, similar in nature to those of Theorems [5] and [6j redundancy 
that diminishes (slower than in the average case) w.r.t. the monotonic ML description length can 
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be obtained. However, these results are only useful when the monotonic ML description length 
diverges only slightly from the (standard) ML description length of a sequence, i.e., the empirical 
distribution of a sequence only mildly violates monotonicity. Otherwise, the penalty of using an 
incorrect monotone model overwhelms the redundancy gain. We begin with sequences that obey 
the monotonicity constraints. 

Theorem 7 Fix an arbitrarily small e > 0, and let n — > oo. Let x n be a sequence for which 9 £ A4, 
i.e., 6\ > 02 > .... Let k = k be the number of letters occurring in x n . Then, there exists a code 
L* (•) that achieves individual sequence redundancy w.r.t. Om = 9 for x n which is upper bounded 
by 

' (l + e)^log^f^, fork<n^, 



Rn(L*,X n ) < { 



(l + £ )(logn)(log^j^, forn 1 / 3 <k = o(n), (79) 
(l + e)i(logn) 2 ^, forn 1 / 3 <k = 0{n). 



Note that by the monotonicity constraint, the number of symbols k occurring in x n also equals to 
the maximal symbol in x n . Since, in the individual sequence case, this maximal symbol defines the 
class considered and also to be consistent with Theorem (3] we use k to characterize the alphabet 
size of a given sequence. (The maximal symbol in the individual sequence case is equivalent to the 
alphabet size in the average case.) Finally, since 9 is monotonic, Om = 9. 

Proof of Theorem [7J The result in Theorem [7J follows directly from the proof of Theorem [H 
Both regions of the proof apply here, where instead of quantizing 9 to 9 , we quantize 9 to 9 in a 
similar manner, and do not need to average over all sequences. In fact, instead of using any general 
Cp to code x n , we can use 9 without any additional optimizations, where logn bits describe k. The 
description costs of 9 are almost the same as those of 9 . The factor 2 reduction in the last region 
is because it is sufficient here to replace n 2 by n in the denominators of (|32|) . This is because for 
every occurring symbol Q\ > 1/n and S{ < 1/n, thus the first term of step (a) in (|39p holds with 
the new grid, and Bi in (|36|) reduces by a factor of 2. The quantization costs bounded in ([28]) and 
(]39p are thus bounded similarly, where 9 replaces 9 and 9 replaces 9' . This results in the bounds 
in (]79p and concludes the proof of Theorem [7] □ 



If one a-priori knows that x n is likely to have been generated by a monotonic distribution, 
the case considered in Theorem [7J is with high probability the typical one. However, a typical 
sequence can also be one for which 9 Ai , where 9 mildly violates the monotonicity. In the pure 
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individual sequence setting (where no underlying distribution is assumed but some monotonicity 
assumption is reasonable for the empirical distribution of x n ), one can still observe sequences that 
have empirical distributions that are either monotonic or slightly diverge from monotonic. Coding 
for this more general case can apply the methods described in Section [5] to the individual sequence 
case. If the divergence from monotonicity is small, one may still achieve bounds of the same order 
of those presented in Theorem [7] with additional negligible cost of relaying which symbols are out 
of order. The next theorem, however, provides a general upper bound in the form of the bounds 
of Theorems [5] and [6] for the individual sequence redundancy w.r.t. the monotonic ML description 
length, as defined in (|10p . We begin, again, with some notation. 

Recall the definition of an effective alphabet size m = m p = n p (where p = (log m)/ (log n).) 
Now, use this definition for a specific individual sequence x n . Let 

^-log^, m<n 1 ^, 



m—l 



K n (m) = 



mlog^r, n 1 / 3 < m = o {y/n) , (80) 

min a<p | _ Q ) (logn) 2 n a + 3 (log e)n 1_Q | , otherwise. 



Theorem 8 Fix an arbitrarily small e > 0, and let n —* oo. Then, there exists a code with length 
function L*(-), that achieves individual sequence redundancy w.r.t. the monotonic ML description 
length of x n ( as defined in fW\) ) bounded by 

R n (L\x n )<—mm\n n (n'')+(l + -) £ log^J> (Si.) 



n p \ \ p. . 



for every x n . 



Theorem [8] shows that if one can find a relatively small effective alphabet of the symbols that 
occur in x n , and the symbols outside this alphabet are small enough, x n can be described with 
diminishing per-symbol redundancy w.r.t. its monotonic ML description length. This implies that 
as long as the occurring symbols are not too large, there exist a universal code w.r.t. a monotonic 
ML distribution for any such sequence x n . This is unlike standard individual sequence compression 
w.r.t. the i.i.d. ML description length. Specifically, if the effective alphabet size is 0(n), and 
only a small number of symbols which are only polynomial in n occur, the universality cost is 
0(y/n(\ogn) 2 ) bits overall, which gives diminishing per-symbol redundancy of 0((logn) 2 / ^/n). 
This redundancy is much better than what can be achieved in standard compression. The penalty, 
of course, is when the empirical distribution of an individual sequence diverges significantly away 
from a monotonic one. While the monotonic redundancy can be made diminishing under mild 
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conditions, there is a non-diminishing divergence cost by using the monotonic ML description 
length instead of the ML description length in that case. This implies that one should compress 
a sequence as generated by a monotonic distribution only if the total description length required 
to code x n as such is shorter than the total description length required to code x n with standard 
methods. As shown in the proof of Theorem [HJ one prefix bit can inform the decoder which type 
of description is used. 

Theorem [8] shows that as long as the effective alphabet size is polynomial inn, a = 0.5 optimizes 
the third region of the upper bound, thus yielding the rate shown above, unless very large symbols 
occur in x n . For small effective alphabets (the first region), there is no redundancy gain in using 
the monotonic ML description length over the ML description length. The reason, again, is that 
the bound is obtained for cases where the actual empirical distribution of a sequence may not be 
monotonic. One can still use an i.i.d. ML estimate w.r.t. only the effective alphabet, if the additional 
cost of symbols outside this alphabet is negligible, to better code such sequences. Theorem [8] also 
shows that if a very large symbol, such as i = a n ; a > 1, occurs in x n , x n cannot be universally 
compressed even w.r.t. its monotonic ML description length. This is because it is impossible to 
avoid the cost of (1 + e) log i = (1 + e)n log a bits to describe this symbol to the decoder. The bound 
above and its proof below give a very powerful method to individually compress sequences that 
have an almost monotonic empirical distribution but may have some limited disorder, for which 
the monotonic ML description length diverges only negligibly from the ML description length. 

Proof of Theorem O The proof follows the same steps as the proof of Theorems [5] and [U Each 
value of m is tested and the best one is chosen, where the same coding costs described in the 
mentioned proof are computed for each m. In addition, one can test the cost of coding x n using the 
description lengths for both 6 and 9m- Then, one bit can be used to relay which ML estimator is 
used. If is used, the codes for coding individual sequences over large alphabets in either [21] or [25] 
can be used. In the first region in (|8ip . the bound in [25] is obtained since log Pg (x 11 ) > log P§ {x n ) 
for every x n . This bound yields smaller redundancy for this region than that obtained using 6 m 
if djii ^ 6. It implies that for small alphabets, if x n does not have an empirical monotonic 
distribution, it is better coded, even in terms of universal coding redundancy, using standard 
universal compression methods without taking advantage of a monotonicity assumption. 

For the other two regions, we start with a lemma. 
Lemma 6.1 Let Om = ( Qi,M > &2,M , ■ ■ ■ > &k,M ) be the monotonic ML estimator of 6 from x n , i.e., 
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h,M — @2,M > • • • > Qk,M> where k = maxjxi, X2, ■ ■ ■ ,x n }. Then, 



M 



> 



1 

kn 



(82) 



Lemma [6,ll provides a lower bound on the minimal nonzero probability component of the monotonic 
ML estimator. This bound helps in designing the grid of points used to quantize the monotonic 
ML distribution of x n , while maintaining bounded quantization costs. The proof of Lemma 16. II is 
in |Appendix C[ 

For m in the second region, we cannot use the grid in (|18p . The reason is that, here, the 
quantization cost is affected by both 6 and Om- This is unlike the average case, where the av- 
erage respective vectors merge. To limit the quantization cost for very small probabilities, using 
Lemma |6.1| the minimal grid point must be 1/n 2 or smaller. To make the quantization cost neg- 
ligible w.r.t. the cost of describing the quantized ML, the ratio Aj/tpi j^ between the spacing in 
interval j, and a quantized version tpiM of OiM hi t ne jth interval, must be 0{m/n). Hence, using 
the same methodology of the proof of Theorems [5] and [6l we define the j th interval for an effective 
alphabet m = n p = o {y/n) as 



The spacing in the jth interval is 



This gives a total of 



n (i-l)/3 n jP\ 

— - l<j<Jp. 



A 



(p) 



mn 



30 



Bp < — log n 
m 



(83) 



(84) 



(85) 



quantization points. Using the same methodology as in (|2ip . this yields a representation cost of 

ii 



L R {(p m ) < (1 +e)mlog- 



(86) 



where cp m is the quantized version of Ojn in which only the first m components of 0m are considered. 
Using the quantization with the grid defined in (|83p - (|86j ) in a code similar to the one used in the 
proof of Theorems [5] and [fH the individual quantization cost is given by 



log 



P(x n \m,S' m ,ip r 



to ^ s 
— n i_j "* 
i=l 

(6) ™ 

< n(loge)2ji 

i=i 



Vj,M 



+ loge 
Si 



<Pi,r. 



log e 



(c) 12 vrvxv 

< (log e) • — k • mn + (log e) • n ■ 



2n 2 



n- 



n° 



n 



10 



+ loge 



3m (log e) + log e. 



(87) 
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where (a) follows the same steps as in ()6Q|) . (b) follows from ln(l + x) < x, and then x < \x\, where 
Si = 6i,M ~ Pi,™, and (c) follows from Lemma 16.11 and the definition of Ij in (|83p (for the worst 
case first term, \5i\ < 1/n 2 and <p% m > l/(mn)), from (|84p and (|83p (the second term), and since 
= 1- The only additional non- negligible cost of coding sequences using a code as defined in 
the proof of Theorems [5] and [6] for a given m is the cost of coding all symbols i > m that occur in 
x n . Using a similar derivation to (|54p . with Elias' asymptotic code for the integers, this yields an 
additional cost of (1 + s) (1 + 1/p) J2i> n p i&x n l°g* code bits. Combining all costs, absorbing low 
order terms in e, and normalizing by n, yields the second region of the bound in (|8ip . Note that 
this bound also applies to the first region, but in that region, a tighter bound is obtained by using a 
code that uses the standard i.i.d. ML estimator 6. This is because very fine quantization is needed 
to offset the cost of mismatch between 6 and Om- This quantization requires higher description 
costs than the description of a quantized type of a sequence when using standard compression. 
(This is not the case when 6 obeys the monotonicity, as in Theorem [7J Even if does not obey 
monotonicity in the upper regions of the bound, this is not the case.) 

For the last region of the bound, we follow the same steps above as was done for the upper 
region of the bound in Theorem [5] with a parameter a. The intervals are chosen, again, to guarantee 
bounded quantization costs. Hence, 



I, 



n (j-i)P n iP 



The spacing in the jth interval is 



This gives a total of 



n p+l+a ' n p+l+a 



A (p) - nJ/3 



1 < 3 < J P - 



n p+l+2a ' 



B p < 0.5n a \{p + 1 + a) log n] (90) 
quantization points. Using the same methodology as in (|56p . this yields a representation cost of 

Lr {<f m ) < (1 + e) p+l + a {p + e-a) {\ognfn a . (91) 

Similarly to flSTJ), 

Pn {X n ) (a) n P+2 

lQ g P , m g / \ < (log e ) ~zrr+^ + (log e)2n 1 ~ Q + log e = 3(log e)n x - a + log e (92) 

r (x \m,D m ,cp m ) nr 

where (a) follows from similar steps to (a)-(c) of (|87p . Using Lemma 16. 1\ ^pi^ m > l/n p+1 and 
\Si\ < l/nf +1+a , leading to the first term. Bounding \5i\ < n j P / n p+1+2a and ^ >m > n^ j ~ 1 ^ /n p+1+a 
leads to the second term. Note that as before, m is used here in place of k, because using an ef- 
fective alphabet m, all greater symbols are packed together as one symbol, and the additional cost 
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to describe them is reflected in an additional term. Adding this additional term with an identical 
expression to that in the lower regions, absorbing low order terms in e, and normalizing by n, 
yields the third region of the bound in (|81|) . Since the bound holds for every a and every p > a, it 
can be optimized to give the values that attain the minimum, concluding the proof of TheoremEl □ 



7 Summary and Conclusions 

Universal compression of sequences generated by monotonic distributions was studied. We showed 
that for finite alphabets, if one has the prior knowledge of the monotonicity of a distribution, 
one can reduce the cost of universality. For alphabets of o(n 1 / 3 ) letters, this cost reduces from 
0.5 log(n/A;) bits per each unknown probability parameter to 0.51og(n/fc 3 ) bits per each unknown 
probability parameter. Otherwise, for alphabets of 0{n) letters, one can compress such sources with 
overall redundancy of 0(ra 1//3+e ) bits. This is a significant decrease in redundancy from O(felogn) 
or 0(n) bits overall that can be achieved if no side information is available about the source 
distribution. Redundancy of 0(n 1 ^ 3+£ ) bits overall can also be achieved for much larger alphabets 
including infinite alphabets for fast decaying monotonic distributions. Sequences generated by 
slower decaying distributions can also be compressed with diminishing per-symbol redundancy 
costs under some mild conditions and specifically if they have finite entropy rates. Examples for 
well-known monotonic distributions demonstrated how the diminishing redundancy decay rates 
can be computed by applying the bounds that were derived. Finally, the average case results were 
extended to individual sequences. Similar convergence rates were shown for sequences that have 
empirical monotonic distributions. Furthermore, universal redundancy bounds w.r.t. the monotonic 
ML description length of a sequence were also derived for the more general case. Under some mild 
conditions, these bounds still exhibit diminishing per-symbol redundancies. 

Appendix A — Proof of Theorem [I] 

The proof follows the same steps used in [25] and [26] to lower bound the maximin redundancies 
for large alphabets and patterns, respectively, using the weak version of the redundancy- capacity 
theorem jjjj. This version ties between the maximin universal coding redundancy and the capacity 
of a channel defined by the conditional probability Pg (x n ). We define a set ^ljv[ k °f points 6 £ M-k- 
Then, show that these points are distinguishable by observing X n , i.e., the probability that X n 
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generated by 6 6 a PP ears to have been generated by another point 6' 6 diminishes 

with n. Then, using Fano's inequality [3], the number of such distinguishable points is a lower 
bound on R~ (Mk)- Since (Aik) > Rn (-M-k), it is also a lower bound on the average minimax 
redundancy. The two regions in ([6]) result from a threshold phenomenon, where there exists a value 
k m of k that maximizes the lower bound, and can be applied to all for k ^ k m . 

We begin with defining tljvik- Let w be a vector of grid components, such that the last k — 1 
components 9i, i = 2, . . . , k, of € must satisfy 6i £ u. Let u>b be the 6th point in u>, and 

define wo = and 

. 6 of„- _ _ 

6 = 1,2,.... (A.l) 



A 



Then, for the 6th point in a;, 



W6 • V n 



-l-£ 



(A.2) 



To count the number of points in ftj^ k , ^ us ^ rs ^ consider the standard i.i.d. case, where there 
is no monotonicity requirement, and count the number of points in fi, which is defined similarly, 
but without the monotonicity requirement (i.e., f2_A/! fc = O). Let 6j be the index of 9i in u, i.e., 
9i = uJbi- Then, from (lA.lj) - (|A.2j) and since the components of 9 are probabilities, 



< i. 



i=2 



i=2 



It follows that for 6 £ ft, 



l-e 



(A.3) 



(A.4) 



j=2 



Hence, since the components 6, are nonnegative integers, 



> 



(<>) 
> 



where Vk-i ( \/n 1 6 ) is the volume of a k — 1 dimensional sphere with radius y^n 1 e , (a) follows 



E 

&2=0 



E 

fe 3 =o 



E 



Vr, 



l-e. 



dxfc • • • dx^dx-2 



(6) 
A 



F 4 



fe-1 



n 



l-e 



<A.5) 



l-e 



from monotonic decrease of the function in the integrand for all integration arguments, and (6) 
follows since its left hand side computes the volume of the positive quadrant of this sphere. Note 
that this is a different proof from that used in |25j-[26j for this step. Applying the monotonicity 
constraint, all permutations of 6 that are not monotonic must be taken out of the grid. Hence, 

V k -i 



A 



n 



l-e 



k\ ■ 2 k ~ 



(A.6) 
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where dividing by k\ is a worst case assumption, yielding a lower bound and not an equality. This 
leads to a lower bound equal to that obtained for patterns in [26J on the number of points in ftj^\ k . 

1 /3 

Specifically, the bound achieves a maximal value for k m = (7rn 1_e /2) and then decreases to 
eventually become smaller than 1. However, for k > k m , one can consider a monotonic distribution 
for which all components 9f,i > k m , of 6 are zero, and use the bound for k m . 

Distinguishability of 9 € ftj^i k ls a direct result of distinguishability of 6 £ ft, which is shown 
in Lemma 3.1 in [25J, i.e., there exits an estimator & g (X n ) E ft for which the estimate 9 g satisfies 
lim n ^ oc Pq (Og^O) = for all 6 G ft. Since this is true for all points in ft, it is also true 



for all points in ft_M k C ft, where now, 6 g £ ^M k - Assuming all points in flM k are equally 
probable to generate X n , we can define an average error probability P e = Pr & g (X n ) ^ 



^Oeft [Qg 9) /^Mk- Using the redundancy-capacity theorem, 

nR- [M k ] > C[M k ^ X n ] > I[G; X n ] = H [0] - H [@\X n ] 
( = 5 log M Mk - H [®\X n ] > (1 - P e ) (log M Mk ) - 1 
> (l-o(l))logM Mfc , (A.7) 

where C[Mt X n ] denotes the capacity of the respective channel and /[0;X n ] is the mutual 
information induced by the joint distribution Pr (Q = 9) ■ Pg (X n ). Inequality (a) follows from the 
definition of capacity, equality (b) from the uniform distribution of in fi_A/! fe > inequality (c) from 
Fano's inequality, and (cQ follows since P e — > 0. Lower bounding the expression in (|A.6|) for the 
two regions (obtaining the same bounds as in [26]), then using (|A.7|) . normalizing by n, and ab- 
sorbing low order terms in e, yields the two regions of the bound in ([6]). The proof of Theorem [1] 
is concluded. □ 



Appendix B — Proof of Theorem [2] 

To prove Theorem [21 we use the random- coding strong version of the redundancy-capacity theorem 



|17j . The idea is similar to the weak version used in Appendix A We assume that grids ft^i k of 
points are uniformly distributed over Mk: an d one grid is selected randomly. Then, a point in the 
selected grid is randomly selected under a uniform prior to generate X n . Showing distinguishability 
within a selected grid, for every possible random choice of ftjn k , implies that a lower bound on the 
cardinality of ftj^ k for every possible choice is essentially a lower bound on the overall sequence 
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redundancy for most sources in M.^. 

The construction of Sljvi k ls identical to that used in [26] to construct a grid of sources that 
generate patterns. We pack spheres of radius 

n -o.5(i- £ ) in 

the parameter space defining Aik- The 
set ^M k consists of the center points of the spheres. To cover the space Ai^, we randomly select 
a random shift of the whole lattice under a uniform distribution. The cardinality of ^M k ^ s lower 
bounded by the relation between the volume of Mk, which equals (as shown in [26]) l/[(k — l)!fe!] 3 
and the volume of a single sphere, with factoring also of a packing density (see, e.g., [2]). This 
yields eq. (55) in [26] , 

MMk ~ (k - 1)1 ■ k\ ■ Vfc-i (n^ - 5 ( 1 ~ £ )) • 2 fc_1 ' (B ' 1} 
where Vk- i (n~ - 5 ^) is the volume of a k — 1 dimensional sphere with radius n °' 5( - 1 £ ) (see, e.g., 
[2 J for computation of this volume). 

For distinguishability, it is sufficient to show that there exists an estimator ® g (X n ) £ &>M k 
such that lim n ^ 00 Pq & g (X n ) ^ & = for every choice of ^M k anc ^ f°r ever y choice of G 
^M k - This is already shown in Lemma 4.1 in [25] for a larger grid ft of i.i.d. sources, which is 
constructed identically to ^M k over the complete k— 1 dimensional probability simplex. Therefore, 
by the monotonicity requirement, for every ^M k > there exists such ft, such that ^M k Q Since 
Lemma 4.1 in |25] holds for fi, it then must also hold for the smaller grid ^M k - Note that 
distinguishability is easier to prove here than for patterns because ® g (X n ) is obtained directly 
form X n and not from its pattern as in [26j. Now, since all the conditions of the strong random- 
coding version of the redundancy-capacity theorem hold, taking the logarithm of bound in (|B.1|) . 
absorbing low order terms in e, and normalizing by n, leads to the first region of the bound in ([7]). 
More detailed steps follow those found in [26]. 

The second region of the bound is handled in a manner related to the second region of the 
bound of Theorem [TJ However, here, we cannot simply set the probability of all symbols i > k m 
to zero, because all possible valid sources must be included in one of the grids ^M k to generate 
a complete covering of Mk- As was done in [26], we include sources with Oi > for i > k m in 
the grids FlM k , but do not include them in the lower bound on the number of grid points. In- 
stead, for k > k m , we bound the number of points in a /c m -dimensional cut of for which the 
remaining k — k m components of 9 are very small (and insignificant). This analysis is valid also for 
k > n. Distinguishability for k > k m is shown for i.i.d. non-monotonically restricted distributions 
in the proof of Lemma 6.1 in [26]. As before, it carries over to monotonic distributions, since as 
before, for each fi» fc > there exists an unrestricted corresponding ft, such that FlM k ^ The 
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choice of k m = 0.5(n 1_£ /-7r) 1 / 3 gives the maximal bound w.r.t. k. Since, again, all conditions of the 
strong version of the redundancy-capacity theorem are satisfied, the second region of the bound is 
obtained. Again, more detailed steps can be found in |26j . This concludes the proof of Theorem[2j □ 



Appendix C — Proof of Lemma 16.1 



For cardinality k, we consider the largest component of Om'i Q\,Mi as the constraint component, 
i.e., Q^m = 1 — J2i=2 @i,M- For any given probability parameter ip of cardinality k with <p\ > 0, we 
have 

k / \ n x (i) k 

P(p ( x «) = ^ W (i _ . Yl ( j t ^« (i _ ^f-MD TJ (C.l) 

i=2 ^ i=2 

where we recall that n x (i) is the occurrence count of i in x n . Therefore, maximization of P« (x n ) 
w.r.t. ipi is independent of the maximization over fif, i > 1, and is obtained for (pi = 6\ = n x (l)/n. 
Since for all i > 1, 6\ t M ^ @i,Mi @i,M can thus only increase from 9\ by the monotonicity constraint. 
(Note that the monotonicity constraint implies a water filling [3] optimization to achieve Om-) 
Hence, Q\.m > n x (l)/n. 

Now, using the result above, we show that the derivative of lnP^ (x n ) w.r.t. <pk,M is positive 
for <fik,M < l/(kn) and a monotonic fj^- A component of a parameter vector which is 

monotonic, can be expressed as 

k 

<PiM=Y,<f f l> ^>0- (C2) 



Hence, 



dhiP VM (x n 



M 



(«) dlnP VM (x n ) 



k 



k 

(6) ^ Ux ^ (/r ~ ' }} '- ri ' ' 



i=2 



<Pi,M Bi M 



(c) knJk) kn x (l) (d) , 
> — ^ = (C.3) 

where (a) follows from pk,M being the smallest nonzero component of y_A/j, (b) is since by (|C2j) . 
ip' k is included in all terms, and 

k fc-i 

Pl,M = 1 - = 1 ~ ~ ~ ( k ~ l )Vk,Mi (C4) 

i=2 i=2 
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where the last equality follows from (|C.2p . (c) follows by omitting all terms of the sum except i = k, 
from the assumption that (fk m < l/(nk) < Ok/k, and since Q\,m > n x {\)/n = 9\, and (cQ follows 
since its left hand side is for the (i.i.d.) ML parameter values. Hence, P^ M (x 11 ) must increase, 
with (fiiM taking its optimal value, for all cpj^ for which ipk,M < V( n ^)> an d the maximum is thus 
achieved for 6t,M — l/(nk). □ 
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